library(dplyr)
library(magrittr)
library(ggplot2)
library(coefplot)
## here we are going to find the root_file 
root <- rprojroot::find_rstudio_root_file()
## we are going to set the data file path here
dataDir <- file.path(root, 'Data')

Introduction

We will first read data from Jared Lander’s website. This data comes from the NYC open data portal but has already been cleaned. It represents housing sales between 2011-2021 in NYC.

housing <- readr::read_csv("http://www.jaredlander.com/data/housing.csv")

let’s make the names something more readable:

names(housing) <- c("Neighborhood", "Class", "Units", "YearBuilt",
                    "SqFt", "Income", "IncomePerSqFt", "Expense",
                    "ExpensePerSqFt", "NetIncome", "Value", 
                    "ValuePerSqFt", "Boro")

Stats Principles

Correlation

One measure of the relationship between two Variables is called correlation. The correlation output is between -1 and 1 with negative meaning an inverse relationship and positive meaning closer relationship.

cor(housing$SqFt, housing$Income)
## [1] 0.8560584

we can even see the correlation between multiple variables.

cor(housing %>% select(c(Income, SqFt, Units, ValuePerSqFt)))
##                 Income      SqFt     Units ValuePerSqFt
## Income       1.0000000 0.8560584 0.7557462    0.4449276
## SqFt         0.8560584 1.0000000 0.9577540    0.2148611
## Units        0.7557462 0.9577540 1.0000000    0.1435213
## ValuePerSqFt 0.4449276 0.2148611 0.1435213    1.0000000

We have already seen this plot but let’s plot our new data with ggpairs

GGally::ggpairs(housing %>% select(c(Income, SqFt, Units, ValuePerSqFt)))

Covariance

Related to correlation, covariance is like a variance between variables. It provides a measure of the strength of the correlation between two or more sets of random variates. Unlike correlation which is divided by the standard deviation the numbers can be outside of -1 and 1.

cov(housing$Income, housing$ValuePerSqFt)
## [1] 141382416

Like correlation, we can compare the covariance between multiple variables.

cov(housing %>% select(c(Income, SqFt, Units, ValuePerSqFt)))
##                    Income         SqFt        Units ValuePerSqFt
## Income       2.149034e+13 571556408169 4.734186e+08 1.413824e+08
## SqFt         5.715564e+11  20742831383 1.863956e+07 2.121173e+06
## Units        4.734186e+08     18639557 1.825977e+04 1.329378e+03
## ValuePerSqFt 1.413824e+08      2121173 1.329378e+03 4.698603e+03

Simple Linear Regression

Let’s take a look at square feet and possible rental income.

housing %>% select(Income, SqFt) %>% head
## # A tibble: 6 x 2
##     Income   SqFt
##      <dbl>  <dbl>
## 1  1332615  36500
## 2  6633257 126420
## 3 17310000 554174
## 4 11776313 249076
## 5 10004582 219495
## 6  5127687 139719

And we can plot the two variables and draw the regression line.

ggplot(housing, aes(x = SqFt, y = Income)) + geom_point(alpha = 1/6, shape = 1) + geom_smooth(method = 'lm') + theme_minimal() +
    ylab("Square Feet") + xlab("Income")

let’s actually get the model using the lm function.

We are using formula notation. On the left of the ~ we are adding our response variable (income), and on the right we are adding our one predictor variable(SqFt).

The results show two coefficients. One for the intercept

sqftLM <- lm(Income ~ SqFt, data = housing)

sqftLM
## 
## Call:
## lm(formula = Income ~ SqFt, data = housing)
## 
## Coefficients:
## (Intercept)         SqFt  
##   360400.29        27.55
sqftInfo <- summary(sqftLM)
sqftInfo
## 
## Call:
## lm(formula = Income ~ SqFt, data = housing)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -55527201   -585422   -408030     99571  28392555 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3.604e+05  5.394e+04   6.681 2.88e-11 ***
## SqFt        2.755e+01  3.248e-01  84.839  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2397000 on 2624 degrees of freedom
## Multiple R-squared:  0.7328, Adjusted R-squared:  0.7327 
## F-statistic:  7198 on 1 and 2624 DF,  p-value: < 2.2e-16
## MSE
sum(residuals(sqftLM)^2) / df.residual(sqftLM)
## [1] 5.743635e+12
##RMSE
sqrt(sum(residuals(sqftLM)^2) / df.residual(sqftLM))
## [1] 2396588
sqftCoef <- as.data.frame(sqftInfo$coefficients[,1:2])
sqftCoef <- within(sqftCoef, {
    Lower <- Estimate - qt(p=.9, df = sqftInfo$df[2]) * `Std. Error`
    Upper <- Estimate + qt(p=.9, df = sqftInfo$df[2]) * `Std. Error`
    coef <- row.names(sqftCoef)
})

ggplot(sqftCoef, aes(x=Estimate, y=coef)) + geom_point() + geom_errorbarh(aes(xmin=Lower, xmax = Upper), height = .3) +
    ggtitle("Rental Comparble Income by Square Feet")

Regression Diagnostics

R^2

percentage of variance explained

#r^2
1 - sum((housing$Income-predict.lm(sqftLM, data.frame(SqFt =housing$SqFt)))^2)/
    sum((housing$Income-mean(housing$Income))^2)
## [1] 0.732836

Outliers

If we want to see the outliers the car package has some nice tools. The function outlierTest gives the studentized residuals which highlights the extreme outliers in our data. \[t_i = r_i (\frac{n-k-2}{n-k-1-r_i^2})^{1/2}\] In the example below we can see that the red data point greatly effects our linear regression.

The studentized resideual(TRES1) is -19.7990 and absolute values larger than 3 can be called outliers.

car::outlierTest(sqftLM)
##        rstudent unadjusted p-value Bonferonni p
## 2569 -29.975974        5.0079e-170  1.3151e-166
## 2568 -20.379881         7.9436e-86   2.0860e-82
## 2567 -13.794873         7.6305e-42   2.0038e-38
## 932   12.243080         1.4805e-33   3.8878e-30
## 732    8.857920         1.4706e-18   3.8618e-15
## 1129   6.169128         7.9291e-10   2.0822e-06
## 2022  -5.978266         2.5615e-09   6.7265e-06
## 890    5.706572         1.2821e-08   3.3667e-05
## 735    5.531466         3.4896e-08   9.1636e-05
## 736    5.527662         3.5652e-08   9.3621e-05
car::leveragePlots(sqftLM)

Influential Observations

The influencePlot function creates a “bubble” plot of Studentized residuals versus hat values, with the areas of the circles representing the observations proportional to the value Cook’s distance. Vertical reference lines are drawn at twice and three times the average hat value, horizontal reference lines at -2, 0, and 2 on the Studentized-residual scale.

cutoff <- 4/((nrow(housing)-length(sqftLM$coefficients)-2)) 
plot(sqftLM, which=4, cook.levels=cutoff)

car::influencePlot(sqftLM, id.method="identify", main="Influence Plot", sub="Circle size is proportial to Cook's Distance" )

##        StudRes       Hat    CookD
## 2568 -20.37988 0.1284836 26.44070
## 2569 -29.97597 0.1982310 82.76901

Non-Normality

Plots empirical quantiles of a variable, or of studentized residuals from a linear model, against theoretical quantiles of a comparison distribution.

car::qqPlot(sqftLM, main="QQ Plot")

## [1] 2568 2569
sresid <- MASS::studres(sqftLM)
data.frame(sresid) %>% ggplot(aes(x =sresid)) + 
    geom_histogram(binwidth = 0.5)  + theme_minimal()

Homoscedasticity

Creates plots for examining the possible dependence of spread on level, or an extension of these plots to the studentized residuals from linear models. We’re interested in the variability of a variable is unequal across the range of values of a second variable that predicts it.

# Evaluate homoscedasticity
# non-constant error variance test
car::ncvTest(sqftLM)
## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 114180.4, Df = 1, p = < 2.22e-16
# plot studentized residuals vs. fitted values 
car::spreadLevelPlot(sqftLM)

## 
## Suggested power transformation:  0.2425346

Nonlinerarity

crPlots constructs component+residual plots, also called partial-residual plots, for linear and generalized linear models.

This is a good way to see if the predictors have a linear relationship to the dependent variable. A partial residual plot basically attempts to model the residuals of one predictor against the dependent variable. A component residual plot adds a line indicating where the line of best fit lies. A significant difference between the residual line and the component line indicates that the predictor does not have a linear relationship with the dependent variable

car::crPlots(sqftLM)

Non-Independence of Errors

the Durbin-Watston statistic is used to detect the presence of autocorrelation at lag 1 in the residuals (prediction errors) from a regression analysis. Remember that we want our predictor variables to be independent. The Durbin-Watson statistic will always have a value between 0 and 4. A value of 2.0 means that there is no autocorrelation detected in the sample. Values from 0 to less than 2 indicate positive autocorrelation and values from from 2 to 4 indicate negative autocorrelation.

Positive autocorrelation means that if the previous value declines, the next value will also decline, while a negative autocorrelation has the opposite effect.

# Test for Autocorrelated Errors
car::durbinWatsonTest(sqftLM)
##  lag Autocorrelation D-W Statistic p-value
##    1       0.5187698     0.9621784       0
##  Alternative hypothesis: rho != 0

Plotting them together

layout(matrix(c(1,2,3,4),2,2))
plot(sqftLM)

coeffcients

Things to look out for

  • residuals have to have mean of zero
  • residuals are not autocorrelated - Durban Watson test
  • normality or errors
  • need more observations than variables.
  • no excessive multicollinearity

Multiple Regression

What happens when you want to have more than two variables and include new predictor variables? You will \[ Y = \beta_0 + \beta_1*x_1 + \beta_2*x_2 + ...\text{Error}\]

Specification Errors

you have to make sure all the relevant variables are in the model or else you will have a systematic error when we want our error to be random.

Challenges

  • visual examination becomes difficult
  • multicollinearity
  • interactions
  • attributing importance to each variable

What happens if you have more than one predictor? you’ll have to use multiple regression.

and we’ll have to explore our data. Let’s

ggplot(housing, aes(x = ValuePerSqFt)) + geom_histogram()

There seems to be a bimodal distribution with peaks at around 80 and 200. We can explore further by mapping colors with boroughs. We can see that Manhattan and Brooklyn accounted for the main peaks exhibiting different distributions.

ggplot(housing, aes(x = ValuePerSqFt, fill = Boro)) + geom_histogram(binwidth = 10) + 
    facet_wrap(~ Boro)

How about other variables? what does square footage or units look like?

We can see that for sq footage, there are some with extremely large area. For the number of units, it follows the same distribution, where most of the sales are in the lower ranges and a few outliers with large numbers.

ggplot(housing, aes(SqFt)) + geom_histogram()

ggplot(housing, aes(Units)) + geom_histogram()

If we plot SqFt and Units against ValuePerSqFt, we can see if we can trim some outliers.

ggplot(housing, aes(SqFt, ValuePerSqFt)) + geom_point()

ggplot(housing, aes(Units, ValuePerSqFt)) + geom_point()

Based on the previous graphs if we can trim units below 1000 and sqft below.

ggplot(housing %>% filter(Units < 1000), aes(Units, ValuePerSqFt)) + geom_point()

housing <- housing %>% filter(Units < 1000)

now let’s fit a model condo1 with our response variable being ValuePerSqFt and predictor variables Units, SqFt, and Boro.

housingFact <- housing %>% mutate(Boro = as.factor(Boro))
condo1 <- lm(ValuePerSqFt ~ Units + SqFt + Boro + YearBuilt, data = housingFact)

We can view the summary of the model

summary(condo1)
## 
## Call:
## lm(formula = ValuePerSqFt ~ Units + SqFt + Boro, data = housing)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -168.458  -22.680    1.493   26.290  261.761 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        4.430e+01  5.342e+00   8.293  < 2e-16 ***
## Units             -1.532e-01  2.421e-02  -6.330 2.88e-10 ***
## SqFt               2.070e-04  2.129e-05   9.723  < 2e-16 ***
## BoroBrooklyn       3.258e+01  5.561e+00   5.858 5.28e-09 ***
## BoroManhattan      1.274e+02  5.459e+00  23.343  < 2e-16 ***
## BoroQueens         3.011e+01  5.711e+00   5.272 1.46e-07 ***
## BoroStaten Island -7.114e+00  1.001e+01  -0.711    0.477    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 43.2 on 2613 degrees of freedom
## Multiple R-squared:  0.6034, Adjusted R-squared:  0.6025 
## F-statistic: 662.6 on 6 and 2613 DF,  p-value: < 2.2e-16

and view only the coefficients

condo1$coefficients
##       (Intercept)             Units              SqFt      BoroBrooklyn 
##      4.430325e+01     -1.532405e-01      2.069727e-04      3.257554e+01 
##     BoroManhattan        BoroQueens BoroStaten Island 
##      1.274259e+02      3.011000e+01     -7.113688e+00

with the coefplot library we can view the coefficients of out model

library(coefplot)
coefplot(condo1, sort = 'mag')

We can now see what difference scale has on our data for Units and SqFt.

condo2 <- lm(ValuePerSqFt ~ scale(Units) + scale(SqFt) + Boro, data = housing)
summary(condo2)
## 
## Call:
## lm(formula = ValuePerSqFt ~ scale(Units) + scale(SqFt) + Boro, 
##     data = housing)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -168.458  -22.680    1.493   26.290  261.761 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         50.459      5.320   9.484  < 2e-16 ***
## scale(Units)       -14.142      2.234  -6.330 2.88e-10 ***
## scale(SqFt)         21.949      2.257   9.723  < 2e-16 ***
## BoroBrooklyn        32.576      5.561   5.858 5.28e-09 ***
## BoroManhattan      127.426      5.459  23.343  < 2e-16 ***
## BoroQueens          30.110      5.711   5.272 1.46e-07 ***
## BoroStaten Island   -7.114     10.008  -0.711    0.477    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 43.2 on 2613 degrees of freedom
## Multiple R-squared:  0.6034, Adjusted R-squared:  0.6025 
## F-statistic: 662.6 on 6 and 2613 DF,  p-value: < 2.2e-16
coefplot(condo2, sort = 'mag')

multiplot(condo1, condo2)

Interaction effects

Sometimes two variables interact and are important features to capture. In R, you either use * or : to create unique combiniations of predictor variables on the dependent variable. The difference between the two is that * will keep the original terms while : will only keep the interaction terms and no the original terms.

condo2 <- lm(ValuePerSqFt ~ Units + SqFt* Boro + YearBuilt, data = housingFact)
summary(condo2)
## 
## Call:
## lm(formula = ValuePerSqFt ~ Units + SqFt * Boro + YearBuilt, 
##     data = housingFact)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -176.87  -20.38    0.94   24.04  253.02 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            -5.890e+02  4.645e+01 -12.681  < 2e-16 ***
## Units                  -1.391e-01  2.393e-02  -5.812 6.94e-09 ***
## SqFt                    9.070e-05  9.442e-05   0.961 0.336840    
## BoroBrooklyn            3.041e+01  7.809e+00   3.895 0.000101 ***
## BoroManhattan           1.319e+02  7.766e+00  16.982  < 2e-16 ***
## BoroQueens              2.957e+01  7.982e+00   3.705 0.000216 ***
## BoroStaten Island      -6.308e+00  1.685e+01  -0.374 0.708084    
## YearBuilt               3.220e-01  2.307e-02  13.960  < 2e-16 ***
## SqFt:BoroBrooklyn       2.345e-05  9.606e-05   0.244 0.807153    
## SqFt:BoroManhattan      1.028e-04  9.268e-05   1.109 0.267565    
## SqFt:BoroQueens         1.619e-05  9.613e-05   0.168 0.866281    
## SqFt:BoroStaten Island  3.095e-05  1.756e-04   0.176 0.860101    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 41.88 on 2512 degrees of freedom
##   (96 observations deleted due to missingness)
## Multiple R-squared:  0.6317, Adjusted R-squared:  0.6301 
## F-statistic: 391.7 on 11 and 2512 DF,  p-value: < 2.2e-16
coefplot(condo2, sort = 'mag', coefficients = condo2$coefficients[-1] %>% names())

Here is an example using the : to remove the original variables and keeping only the interactions.

condo3 <- lm(ValuePerSqFt ~ Units + SqFt:Boro + YearBuilt, data = housingFact)
summary(condo3)
## 
## Call:
## lm(formula = ValuePerSqFt ~ Units + SqFt:Boro + YearBuilt, data = housingFact)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -250.47  -40.83  -10.74   39.10  263.01 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             3.364e+02  5.485e+01   6.134 9.92e-10 ***
## Units                  -1.325e-01  3.195e-02  -4.147 3.48e-05 ***
## YearBuilt              -1.076e-01  2.790e-02  -3.855 0.000119 ***
## SqFt:BoroBronx         -5.620e-04  8.922e-05  -6.299 3.52e-10 ***
## SqFt:BoroBrooklyn      -2.088e-04  4.069e-05  -5.131 3.09e-07 ***
## SqFt:BoroManhattan      3.913e-04  2.744e-05  14.257  < 2e-16 ***
## SqFt:BoroQueens        -2.096e-04  4.355e-05  -4.812 1.58e-06 ***
## SqFt:BoroStaten Island -5.489e-04  1.180e-04  -4.654 3.43e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 55.91 on 2516 degrees of freedom
##   (96 observations deleted due to missingness)
## Multiple R-squared:  0.3426, Adjusted R-squared:  0.3408 
## F-statistic: 187.3 on 7 and 2516 DF,  p-value: < 2.2e-16
coefplot(condo3, sort = 'mag', coefficients = condo3$coefficients[-1] %>% names())

Here is an example of adding more interactions terms.

condo4 <- lm(ValuePerSqFt ~ Units + SqFt*Boro + SqFt*Class +  YearBuilt, data = housingFact)
summary(condo4)
## 
## Call:
## lm(formula = ValuePerSqFt ~ Units + SqFt * Boro + SqFt * Class + 
##     YearBuilt, data = housingFact)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -168.699  -20.178   -1.007   23.106  249.155 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              -5.741e+02  4.645e+01 -12.360  < 2e-16 ***
## Units                    -7.507e-02  2.427e-02  -3.093 0.002005 ** 
## SqFt                     -3.817e-05  1.175e-04  -0.325 0.745261    
## BoroBrooklyn              2.642e+01  7.642e+00   3.458 0.000554 ***
## BoroManhattan             1.252e+02  7.721e+00  16.212  < 2e-16 ***
## BoroQueens                2.414e+01  7.817e+00   3.089 0.002032 ** 
## BoroStaten Island        -1.553e+01  1.648e+01  -0.943 0.346003    
## ClassR4-CONDOMINIUM       1.242e+01  3.050e+00   4.073 4.78e-05 ***
## ClassR9-CONDOMINIUM       1.281e+01  4.808e+00   2.664 0.007777 ** 
## ClassRR-CONDOMINIUM      -4.421e+01  8.150e+00  -5.424 6.38e-08 ***
## YearBuilt                 3.128e-01  2.314e-02  13.515  < 2e-16 ***
## SqFt:BoroBrooklyn         1.897e-05  9.400e-05   0.202 0.840089    
## SqFt:BoroManhattan        1.288e-04  9.096e-05   1.416 0.156966    
## SqFt:BoroQueens           3.153e-05  9.397e-05   0.335 0.737295    
## SqFt:BoroStaten Island    2.813e-05  1.714e-04   0.164 0.869688    
## SqFt:ClassR4-CONDOMINIUM  7.053e-05  7.566e-05   0.932 0.351322    
## SqFt:ClassR9-CONDOMINIUM -5.208e-05  7.810e-05  -0.667 0.504928    
## SqFt:ClassRR-CONDOMINIUM  1.578e-04  8.514e-05   1.853 0.064040 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 40.75 on 2506 degrees of freedom
##   (96 observations deleted due to missingness)
## Multiple R-squared:  0.6521, Adjusted R-squared:  0.6498 
## F-statistic: 276.3 on 17 and 2506 DF,  p-value: < 2.2e-16
coefplot(condo4, sort = 'mag', coefficients = condo4$coefficients[-1] %>% names())

Stepwise Variable Selection

The stepwise regression consists of iteratively removing or adding predictors in order to find the best subset of variables in the data resulting in the best model. The criteria for choosing the best model is usually dictated by the lowest prediction errors through RMSEs.

There are three ways stepwise selection can be operatinalized:

  • Forward which starts with no predictors and iteratively adds the most predictive variable. It stops when the improvement is no longer statistcally significant.

  • Backward starts with all the predictors and iteratively removes the least predictive variable. This again stops when there is no longer improvement in the model.

  • Stepwise(both) this is a combination of forward and backward selections. You start with no variables like the forward method, but then you also remove any variabels that no longer provide improvement like the backward selection.

Criticism

Some view this method as mere data dredging, and is a poor substitute for real subject matter expertise.

Akaike Information Criterion

Another metric used often is the AIC, it is an estimator of the relative quality of statistical models for a given set of data. To derive the value the equation is: \[\text{AIC} = 2k-2ln(\hat{L})\] where \(\hat{L}\) is the maximum value of the likelihood function for the model and \(k\) is the number of estimated parameters in the model. The model with the lowest AIC value is the preferred model. As you can see AIC favors goodness of fit by the likelihood function and penalizes increasing number of parameters. In a simple linear regression there are three parameters, \(\beta_0\) (the intercept), \(\beta_1\) (the cofficient) and the variance of the Gaussian distributions. For any least squares model the residuals are expected to be random and normally distributed and counted as one of the parameters.

Here is an example of running through the different permutations of each step.

nullModel <- lm(ValuePerSqFt ~ 1, data = housingFact %>% filter(complete.cases(.)))

step(nullModel,
     scope = list(lower = nullModel, upper = condo4),
     direction = "both")
## Start:  AIC=21364.55
## ValuePerSqFt ~ 1
## 
##             Df Sum of Sq      RSS   AIC
## + Boro       4   6878123  5085328 19213
## + SqFt       1   1236641 10726810 21091
## + Class      3   1247951 10715500 21092
## + Units      1    716484 11246967 21211
## + YearBuilt  1    113894 11849557 21342
## <none>                   11963451 21364
## 
## Step:  AIC=19213.27
## ValuePerSqFt ~ Boro
## 
##             Df Sum of Sq      RSS   AIC
## + YearBuilt  1    475055  4610273 18968
## + Class      3    250325  4835003 19092
## + SqFt       1    188347  4896981 19120
## + Units      1     85383  4999945 19172
## <none>                    5085328 19213
## - Boro       4   6878123 11963451 21364
## 
## Step:  AIC=18967.74
## ValuePerSqFt ~ Boro + YearBuilt
## 
##             Df Sum of Sq      RSS   AIC
## + Class      3    201095  4409177 18861
## + SqFt       1    105177  4505096 18912
## + Units      1     36715  4573557 18950
## <none>                    4610273 18968
## - YearBuilt  1    475055  5085328 19213
## - Boro       4   7239284 11849557 21342
## 
## Step:  AIC=18861.17
## ValuePerSqFt ~ Boro + YearBuilt + Class
## 
##             Df Sum of Sq      RSS   AIC
## + SqFt       1    110040  4299137 18799
## + Units      1     51735  4357443 18833
## <none>                    4409177 18861
## - Class      3    201095  4610273 18968
## - YearBuilt  1    425825  4835003 19092
## - Boro       4   6165631 10574808 21061
## 
## Step:  AIC=18799.38
## ValuePerSqFt ~ Boro + YearBuilt + Class + SqFt
## 
##              Df Sum of Sq     RSS   AIC
## + SqFt:Class  3     73770 4225367 18762
## + SqFt:Boro   4     52406 4246731 18776
## + Units       1     37092 4262045 18780
## <none>                    4299137 18799
## - SqFt        1    110040 4409177 18861
## - Class       3    205959 4505096 18912
## - YearBuilt   1    354111 4653248 18997
## - Boro        4   5449763 9748900 20858
## 
## Step:  AIC=18761.69
## ValuePerSqFt ~ Boro + YearBuilt + Class + SqFt + Class:SqFt
## 
##              Df Sum of Sq     RSS   AIC
## + SqFt:Boro   4     47787 4177580 18741
## + Units       1     22666 4202701 18750
## <none>                    4225367 18762
## - Class:SqFt  3     73770 4299137 18799
## - YearBuilt   1    358309 4583676 18965
## - Boro        4   5356184 9581550 20820
## 
## Step:  AIC=18740.99
## ValuePerSqFt ~ Boro + YearBuilt + Class + SqFt + Class:SqFt + 
##     Boro:SqFt
## 
##              Df Sum of Sq     RSS   AIC
## + Units       1     15885 4161695 18733
## <none>                    4177580 18741
## - Boro:SqFt   4     47787 4225367 18762
## - Class:SqFt  3     69151 4246731 18776
## - YearBuilt   1    300373 4477952 18914
## 
## Step:  AIC=18733.37
## ValuePerSqFt ~ Boro + YearBuilt + Class + SqFt + Units + Class:SqFt + 
##     Boro:SqFt
## 
##              Df Sum of Sq     RSS   AIC
## <none>                    4161695 18733
## - Units       1     15885 4177580 18741
## - Boro:SqFt   4     41006 4202701 18750
## - Class:SqFt  3     57437 4219132 18762
## - YearBuilt   1    303327 4465022 18909
## 
## Call:
## lm(formula = ValuePerSqFt ~ Boro + YearBuilt + Class + SqFt + 
##     Units + Class:SqFt + Boro:SqFt, data = housingFact %>% filter(complete.cases(.)))
## 
## Coefficients:
##              (Intercept)              BoroBrooklyn  
##               -5.741e+02                 2.642e+01  
##            BoroManhattan                BoroQueens  
##                1.252e+02                 2.414e+01  
##        BoroStaten Island                 YearBuilt  
##               -1.553e+01                 3.128e-01  
##      ClassR4-CONDOMINIUM       ClassR9-CONDOMINIUM  
##                1.242e+01                 1.281e+01  
##      ClassRR-CONDOMINIUM                      SqFt  
##               -4.421e+01                -3.817e-05  
##                    Units  ClassR4-CONDOMINIUM:SqFt  
##               -7.507e-02                 7.053e-05  
## ClassR9-CONDOMINIUM:SqFt  ClassRR-CONDOMINIUM:SqFt  
##               -5.208e-05                 1.577e-04  
##        BoroBrooklyn:SqFt        BoroManhattan:SqFt  
##                1.897e-05                 1.288e-04  
##          BoroQueens:SqFt    BoroStaten Island:SqFt  
##                3.153e-05                 2.813e-05

This library shows us the relative importance of variables for our original simple model.

library(relaimpo)

crlm <- calc.relimp(condo1,
            type = c("lmg", "last", "first"), 
            rela = TRUE )
plot(crlm)

Example with cars data

mpg <- mpg %>% mutate_if(is.character, ~as.factor(.x))
carsLm <- lm(hwy~displ+year+cyl+trans+drv+fl+class, mpg)
stepMod <- stepAIC(carsLm, direction = "both")
## Start:  AIC=312.83
## hwy ~ displ + year + cyl + trans + drv + fl + class
## 
##         Df Sum of Sq     RSS    AIC
## - trans  9     21.02  740.50 301.57
## <none>                719.48 312.83
## - displ  1     30.02  749.50 320.40
## - cyl    1     48.82  768.30 326.19
## - drv    2    114.03  833.51 343.26
## - year   1    109.35  828.83 343.94
## - class  6    420.05 1139.53 408.43
## - fl     4    569.49 1288.98 441.27
## 
## Step:  AIC=301.57
## hwy ~ displ + year + cyl + drv + fl + class
## 
##         Df Sum of Sq     RSS    AIC
## <none>                740.50 301.57
## - displ  1     33.98  774.48 310.07
## + trans  9     21.02  719.48 312.83
## - cyl    1     54.72  795.22 316.25
## - drv    2    127.95  868.46 334.87
## - year   1    139.32  879.82 339.91
## - class  6    474.55 1215.06 405.45
## - fl     4    580.57 1321.08 429.03
stepMod$anova
## Stepwise Model Path 
## Analysis of Deviance Table
## 
## Initial Model:
## hwy ~ displ + year + cyl + trans + drv + fl + class
## 
## Final Model:
## hwy ~ displ + year + cyl + drv + fl + class
## 
## 
##      Step Df Deviance Resid. Df Resid. Dev      AIC
## 1                           209   719.4808 312.8309
## 2 - trans  9 21.02381       218   740.5047 301.5705
carsim <- calc.relimp(carsLm,
            type = c("lmg", "last", "first"), 
            rela = TRUE )
plot(carsim)

Generalized Linear Models

Here is a table of the different types of generalized linear models that could be fit using the glm function.

Logistic Regression

housingBoro <- housing %>% filter(Boro %in% c("Manhattan", "Brooklyn")) %>% 
    mutate(Boro = as.factor(Boro)) %>% 
    dplyr::select(-c("Neighborhood", "Class"))
Boro1 <- glm(Boro ~. , data = housingBoro,
          family = binomial(link="logit"))
summary(Boro1)
## 
## Call:
## glm(formula = Boro ~ ., family = binomial(link = "logit"), data = housingBoro)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6332  -0.2926   0.0409   0.1444   3.7449  
## 
## Coefficients: (1 not defined because of singularities)
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     3.720e+01  4.341e+00   8.569  < 2e-16 ***
## Units           3.902e-03  4.320e-03   0.903   0.3664    
## YearBuilt      -2.334e-02  2.182e-03 -10.700  < 2e-16 ***
## SqFt            1.765e-06  7.084e-06   0.249   0.8032    
## Income          5.480e-07  1.120e-06   0.489   0.6246    
## IncomePerSqFt   5.163e-01  8.602e-02   6.002 1.95e-09 ***
## Expense        -1.451e-07  1.461e-06  -0.099   0.9209    
## ExpensePerSqFt -2.741e-01  1.203e-01  -2.279   0.0227 *  
## NetIncome              NA         NA      NA       NA    
## Value          -1.127e-07  1.428e-07  -0.789   0.4301    
## ValuePerSqFt   -1.112e-02  1.042e-02  -1.068   0.2856    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2487.42  on 2001  degrees of freedom
## Residual deviance:  749.86  on 1992  degrees of freedom
##   (92 observations deleted due to missingness)
## AIC: 769.86
## 
## Number of Fisher Scoring iterations: 8
coefplot(Boro1, sort = 'mag',coefficients = names(Boro1$coefficients)[-1])

Boro2 <- glm(Boro ~IncomePerSqFt , data = housingBoro,
          family = binomial)
summary(Boro2)
## 
## Call:
## glm(formula = Boro ~ IncomePerSqFt, family = binomial, data = housingBoro)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.0485  -0.3954   0.0364   0.1868   3.8703  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -9.84115    0.47461  -20.73   <2e-16 ***
## IncomePerSqFt  0.41190    0.02009   20.50   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2691.29  on 2093  degrees of freedom
## Residual deviance:  976.37  on 2092  degrees of freedom
## AIC: 980.37
## 
## Number of Fisher Scoring iterations: 7
logitPlot <- data.frame(IncomePerSqFt = seq(min(housingBoro$IncomePerSqFt), max(housingBoro$IncomePerSqFt),len=nrow(housingBoro)))
logitPlot$Boro = predict(Boro2, newdata=logitPlot, type="response")

housingBoro %>% mutate(Boro = if_else(Boro == "Manhattan", 1, 0)) %>% # have to reclassify manhattan as 1
    ggplot(aes(x=IncomePerSqFt, y=Boro)) + geom_jitter(shape = 1, height = 0.005, alpha = .25) + 
    geom_line(data = logitPlot, aes(x = IncomePerSqFt, y=Boro), colour = "red") +
    theme_minimal()

glmnet

Glmnet is a package that fits a generalized linear model via penalized maximum likelihood. The regularization path is computed for the lasso or elasticnet penalty at a grid of values for the regularization parameter lambda.

\[\text{min}_{\beta_0, \beta}\frac{1}{N}\sum ^N _{i=1}w_il(y_i,\beta_0+\beta^Tx_i) +\lambda[(1-\alpha)||\beta||_2^2/2+\alpha||\beta||_1]\]

Regularization

If you recall the concepts of overfitting and parsimony, simplifying models as much as possible is beneficial. Regularization is a popular technique for avoiding overfitting which adds another element to the loss function.

Ridge Regression

\[L = \Sigma(\hat{Y_i}-Y_i)^2 + \lambda\Sigma\beta^2\]

This loss function has two parts, the first is the squared errors, and the second sums over the squared \(\beta\) and multiples it by a new term \(\lambda\). We want to penalize the model for larger numbers of coeffiencets.

Lasso Method

This loss function is slightly different from the previous method

\[L = \Sigma(\hat{Y_i}-Y_i)^2 + \lambda\Sigma|\beta|\]

where you can see the \(\beta\) is now a absolute value. Unlike the Ridge method where coefficients are pushed towards zero, the Lasso will set irelevant coeffiencts to zero.